Drain LibMR threads to a safe point before fork() (MOD-15307)#97
Drain LibMR threads to a safe point before fork() (MOD-15307)#97gabsow wants to merge 7 commits into
Conversation
Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2 mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked event-loop thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Logs the time spent quiescing the threads (excluding the fork that follows), plus whether the event-loop thread parked and the busy-worker count, so the pre-fork drain cost can be measured. Debug level so it is silent in production. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the coordinator only logged a fixed "execution max idle reached" string with no clue which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and endpoint(s), replies received vs expected, the elapsed wait, and time since last progress — at warning level so it is captured at the default loglevel (it coincides with the user-visible failure and is rate-bounded by nMaxIdleReached). Track responders by recording each ACK / NOTIFY_DONE sender node-id into a heap-strings set on the Execution (freed in MR_FreeExecution, the single final owner); pending = cluster peers minus that set, formatted by a new MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug. All new state is touched only on the event-loop thread, so no extra locking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0, ignoring jobs that are queued but not yet picked up by a worker. Such a job could be dequeued and start running (and allocating) immediately after the check, right as fork() runs -- defeating the drain. Add mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot "fork drain ignores queued jobs" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f6fe687. Configure here.
mr_thpool_num_jobs_in_queue() read jobqueue.len without holding jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data race that could make MR_DrainForFork see a stale (zero) queue and fork early. Take rwmutex for the read. Addresses the bugbot "unlocked job queue length read" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@galcohen-redislabs wanted your opinion on the last commit ( Two things I am not sure about:
Happy to drop the lock and instead annotate the read as a benign race (matching |
| * `responded` set into "id(ip:port),..." in `out`, so a max-idle timeout can | ||
| * name the non-responding shard(s). Runs on the event-loop thread (same as the | ||
| * message handlers), so reading the cluster node table needs no extra locking. | ||
| * The node-id is the same NUL-terminated string used as the responded-set key. */ |
There was a problem hiding this comment.
The whole comment is too specific: no need to specify the ticket number and no need to explain the implementation and cause for this specific issue.
There was a problem hiding this comment.
The new diagnostic data of more explicit names (of workers that didn't respond after a timeout, etc.), including the needed locking mechanism and the monotonic clock, etc. are not really related to the core issue in this ticket (i.e., the forking coordination between the module and the redis core).
The whole thing is just a code slop and should not be included in this PR. If we really want such code (only if it is really a big issue, which honestly I don't think it is...) then we should open a specific ticket for that.
| Node* n = mr_dictGetVal(entry); | ||
| if (n->isMe) continue; | ||
| if (responded && mr_dictFind(responded, n->id)) continue; | ||
| int w = snprintf(out + off, outLen - off, "%s%s(%s:%u)", |
There was a problem hiding this comment.
The size_t math is dangerous. As a rule of thumb you don't want to subtract two unsigned (for the fear of a wrap-around to large numbers).
Better to add an explicit guard here and change the comparison below to use additions only.
|
|
||
| size_t MR_ClusterGetSize(); | ||
|
|
||
| /* MOD-14615: format peer shards (id(ip:port)) that are NOT in `responded` into |
There was a problem hiding this comment.
Here too: no need to include ticket number or implementation details or reasoning for doing so or apologies about the struct declaration...
There was a problem hiding this comment.
(also in all other places; we don't need comments explaining how and why code was written to solve a specific ticket)
| REDISMODULE_NOT_USED(ctx); | ||
| pthread_mutex_lock(&mr_forkDrainLock); | ||
| mr_forkElParked = 1; | ||
| pthread_cond_broadcast(&mr_forkDrainCond); |
There was a problem hiding this comment.
Why not use the already-existing sync. functions in utils?
There was a problem hiding this comment.
Also: the broadcasting back and forth between ml and worker threads (with its extra locks and hidden locks such that in mr_thpool_num_jobs_in_queue()) pause a threat of increasing latencies for all commands. I would really like a much simpler solution, if possible.
- Wait for the worker pool to drain via the pool's own bounded idle wait (new mr_thpool_wait_timeout, reusing thcount_lock/threads_all_idle) instead of a hand-rolled poll loop plus a per-read jobqueue lock. Removes the hot-path lock contention that risked added command latency. - Remove the max-idle "which shard didn't reply" diagnostics entirely (responded-sets, monotonic timestamps, dispatch/timeout logging, MR_ClusterFormatPendingPeers, mr_thpool_num_jobs_in_queue) -- out of scope for the fork-coordination fix; will be tracked in a separate ticket if needed. - Trim comments (no ticket numbers / implementation rationale). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@galcohen-redislabs thanks for the thorough review — addressed all of it in
PTAL. |

Summary
Adds
MR_DrainForFork()/MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread),MR_DrainForFork():so no LibMR thread holds a libc lock at
fork()(which would ghost-lock the child).MR_ResumeAfterFork()releases the parked event-loop thread (called after the fork, on success or cancel).Cooperative, bounded, fail-open. Deliberately not the existing
mr_thpool_pause(SIGUSR2), which can freeze a worker mid-mallocholding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.Why / depends on
Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new
FORK_CHILD_PREsubevent (redis/redis#15327).Pre-merge note
The two
RedisModule_Log(..., "notice", ...)lines inMR_DrainForFork/MR_ResumeAfterForkshould be downgraded todebug(kept atnoticeto confirm the drain fires during CI validation).🤖 Generated with Claude Code
Note
Medium Risk
Fork-time threading and allocator safety are sensitive; incomplete drain can still leave ghost locks, though behavior is bounded and fail-open with warnings.
Overview
Adds
MR_DrainForFork()andMR_ResumeAfterFork()so embedding modules can quiesce LibMR immediately beforefork().MR_DrainForFork()posts a task that parks the event-loop thread at a between-tasks safe point, then bounded-waits (2s) for that park and for the execution worker pool to go idle via newmr_thpool_wait_timeout(). If quiescence is incomplete, it logs a warning and continues (fail-open).MR_ResumeAfterFork()unblocks the parked event-loop thread and must be paired with every drain (after fork success or cancel).This is intentionally not
mr_thpool_pause(SIGUSR2), which can stop workers mid-mallocand worsen ghost-lock risk.Reviewed by Cursor Bugbot for commit df7cc09. Bugbot is set up for automated code reviews on this repo. Configure here.